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1.  Introduction 


This  report  provides  new  users  of  the  Divergence  Measures  Tool  (DMTool)  with  a  brief 
overview  of  its  core  functionality  and  a  hands-on  tutorial  with  two  files  to  work  with  the  tool 
immediately.  The  DMTool  at  this  stage  is  a  work- in-progress:  it  supports  our  in-house  U.S. 

Army  Research  Laboratory  (ARL)  research,  as  initially  reported  in  Jaja  et  al.  (2012). *  An 
extended  DMTool  report  that  covers  the  implementation  of  the  measurements  and  recent 
extensions  is  forthcoming^' 

The  initial  impetus  for  building  this  software  application  was  our  work  with  natural  language 
(NL)  text  classifiers.  As  we  began  examining  available  text  datasets  for  training  and  testing  the 
two-way  classifiers  that  we  wanted  to  build,  it  became  apparent  that  we  would  need  to  construct 
additional  text  corpora  and  calibrate  them  for  “how  dissimilar”  they  were  from  each  other. *  Since 
we  did  not  know  a  priori  which  measures  would  be  adequately  sensitive  in  detecting  differences 
across  a  wide  range  of  Arabic-language  text  files  (they  varied  by  genre,  domain,  spelling 
variation,  size,  etc.),  our  initial  requirement  was  for  a  tool  that  would  support  this  calibration 
evaluation  task. 

As  a  result,  the  DMTool  runs  a  suite  of  seven  information-theoretic  divergence  measures  on 
given  pairs  of  user  files  (called  P  and  Q)  and  provides  users  with  both  the  resulting  scores  and 
five  list  views  of  the  file  terms  with  their  frequencies,  as  used  in  computing  the  scores.  Each  list 
view  can  toggle  between  an  alphabetic  and  frequency-based  ordering  of  all  terms  in  that  view. 
The  five  views  show  all  terms  and  their  frequencies:  in  the  P  file,  in  the  Q  file,  in  only  the  P  file 
and  not  the  Q  file,  in  only  the  Q  file  and  not  the  P  file,  and  in  both  the  P  and  Q  files. 

The  measures  all  compare — each  in  a  slightly  different  way — the  distribution  of  the  relative 
frequency  of  the  terms  in  the  files. §  The  basic  capability  of  the  DMTool  includes  seven 
divergence  measures:  Renyi,  Kullback-Leibler,  Bhattacharyya,  Jensen-Shannon,  Variational 
(also  known  as  LI,  taxicab,  Manhattan,  city-block),  Euclidean  (also  known  as  L2),  and  Cosine 
(Renyi,  1961;  Kullback  and  Leibler,  1951;  Bhattacharyya,  1943;  Singhal,  2001;  Lin,  1991; 
Huang,  2008).  For  mathematical  definitions  of  the  measures,  see  appendix  A. 


We  welcome  all  feedback  on  the  tool  so  that  we  can  continue  to  improve  it. 

’  The  Batch  tab  and  Super-batch  tab  overviewed  in  section  7  have  been  extended  and  refined  over  the  last  year  as  part  of  our 
ongoing  collaboration  with  Dr.  Terrence  Moore  (ARL). 

+  We  followed  the  standard  assumption  in  the  field  of  computational  linguistics  that  the  more  dissimilar  the  files,  the  better 
the  classifiers  would  perform  in  correctly  categorizing  the  files. 

§  These  measures  are  string-based:  they  do  not  provide  any  deeper  lexical,  semantic,  or  conceptual  analyses  of  the  terms. 
They  simply  count  the  terms  and  have  no  basis  for  understanding  the  meaning(s)  of  these  terms,  or  their  relation  to  each  other. 
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These  measures  have  been  put  to  many  uses  in  natural  language  processing  (NLP).  In  the 
evaluation  of  machine  translation  (MT)  engines,  for  example,  researchers  have  begun  to  estimate 
the  quality  of  the  MT  output  for  a  previously  unseen  input  text  by  calculating  the  divergence 
between  the  MT  training  data  and  the  new  input  text.  For  the  scenario  where  multiple  MT 
engines  trained  on  different  data  are  available  at  MT  runtime,  these  measures  may  help 
determine  which  engine  will  provide  the  best  results  for  a  new  text  set. 

The  tool  is  organized  by  tabs.  The  user  can  always  see  all  tabs  by  name,  like  a  traditional  filing 
system,  where  the  names  appear  individually  lined  up  across  the  top  of  the  interface.  Only  one 
tab  is  selected  and  open  at  a  time,  with  its  full  screen  and  functionality  available  to  the  user.  To 
change  the  view  from  one  tab  to  another,  the  user  can  simply  click  on  a  different  tab  to  open  that 
one  and  close  the  current  one.  (There  is  no  automated  workflow  between  tabs.) 

The  Basic  tab  allows  the  user  to  upload,  preprocess  the  text,  visually  compare  the  terms  in  the 
two  corpora,  and  calculate  the  divergence  measures  on  the  two  corpora. 

The  Help  tab  provides  ready  access  to  information  on  the  underlying  equations  and  definitions 
used  in  the  tool. 

Several  other  tabs  provide  support  for  other  use  cases.  We  include  in  this  introduction  an 
overview  of  the  Classifier  tab  that  was  the  original  use  case  motivating  the  construction  of  this 
tool.  The  functionality  of  the  other  tabs  in  the  DMTool  are  described  briefly  in  section  7  of  this 
report  and  will  be  covered  in  more  detail  in  the  forthcoming  extended  DMTool  technical  report. 
Specific  user  questions  are  addressed  in  appendix  D  of  this  report. 


2.  User  Manual  Conventions 


In  this  report,  we  adhere  to  the  following  typographical  conventions  to  distinguish  among 
different  types  of  information: 

•  Words  with  technical  definitions  (as  spelled  out  in  section  5)  are  italicized. 

•  References  to  sub-sections  in  this  user  manual  are  bolded  and  italicized.  In  Microsoft 
Word,  all  internal  section  references  also  work  as  links  to  the  referenced  section  or  sub¬ 
section  when  clicked  while  holding  down  the  Ctrl  button. 

•  References  to  labels  from  the  tool  are  bolded  and  capitalized. 

•  References  to  input  or  output  for  the  tool  are  displayed  in  Courier  New  typeface. 

To  allow  for  ongoing  documentation  of  the  DMTool  over  time  as  it  has  evolved,  we  made  the 
following  decisions  in  creating  the  screenshots: 
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•  Many  screenshots  are  cropped  down  to  that  portion  of  screen  layout  most  relevant  to  the 
accompanying  description.  Cropped  screenshots  in  figures  in  sections  6,  7,  and  8  of  the 
manual,  correspond  respectively  to  partial  views  of  the  Basic,  Classifier,  or  Help  tab 
screens. 

•  Later  versions  of  the  DMTool  have  more  tabs.  The  use  cases  that  these  tabs  support  are 
described  in  section  7,  with  details  for  running  these  tabs  in  the  extended  technical  report. 


3.  Requirements 


The  following  are  the  requirements  for  the  tool: 

•  This  tool  is  compatible  with  Windows  XP,  Windows  Vista,  and  Windows  7. 

•  Any  files  uploaded  into  the  tool  must  be  .txt  files  in  ASCII  or  UTF-8  format. 

•  This  tool  has  been  tested  on  English  and  Arabic  script**,  but  should  work  on  any  other 
language  with  ASCII  or  UTF-8  encoding. 

•  For  best  results,  the  input  text  should  have  its  punctuation  tokenized,  i.e.,  separating  each 
punctuation  mark  from  a  word  it  is  attached  to,  by  adding  in  an  extra  blank  space. ++  The 
input  text  should  be  lowercased,  when  applicable.  Both  of  these  preprocessing  options  are 
available  within  the  tool  itself  when  the  user  uploads  text  (see  figure  2). 

•  The  divergence  measure  calculations  are  intended  to  distinguish  the  distribution  of  words 
in  two  sets.  The  smaller  the  sets  being  compared,  the  less  likely  they  are  to  capture  the  true 
distributions  from  which  they  were  drawn.  We  caution  the  user  here  and  leave  it  to  each 
individual  to  determine  the  appropriate  set  sizes  for  their  application. 


4.  Installation  and  Setup 


Included  with  this  tool  should  be  four  files: 

•  DMT  .exe 

•  Sample  l.txt 

•  Sample  2.txt 

So  far,  this  has  included  Modern  Standard  Arabic,  Farsi,  Dari,  Urdu,  and  Pashto. 
’  ’  Exceptions  to  this  approach  may  include  preserving  numeric  expressions. 
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Place  these  files  together  in  the  folder  where  the  tool  is  stored.  To  run  the  tool,  simply  double 
click  the  “DMT.exe”  file.  A  tutorial  on  Sample  l.txt  and  Sample  2.txt  is  provided  in  appendix  C. 


5.  Definitions 


The  following  are  definitions  used  in  this  manual  in  describing  the  tool: 

•  type:  unique  word  in  a  given  text 

•  token :  any  instance  of  a  word  in  a  given  text 

•  corpus  (pi.  corpora ):  a  collection  of  written  texts 

6.  Basic  Tab 


The  Basic  tab  is  the  default  tab  that  opens  when  the  tool  is  started.  It  provides  for  the  following 
operations. 

6.1  Uploading  User  Files 

At  the  top  left  of  the  screen,  as  shown  in  figure  1,  is  where  one  may  upload  two  text  files  with 
the  option  to  name  them  each  with  a  shorter  label.  To  find  and  upload  the  files  from  a  computer, 
click  the  button  to  the  right  of  the  File  fields  and  browse  to  the  desired  files.  If  short  labels 
are  entered  into  the  Name  fields,  these  names  will  appear  (in  lieu  of  the  source  file  names)  in  the 
DMTool’s  output  reports  (see  section  6.7  Reports).  The  names  cannot  be  identical,  must  be  27  or 
fewer  characters,  and  cannot  include  the  following  symbols:  [  ]  *  ?  /  \  .  The  output  folder  to  hold 
any  reports  generated  by  the  DMTool  will  be  located,  by  default,  in  the  user’s  Documents  folder. 
This  may  be  changed  if  desired,  by  clicking  the  “. . .”  button  to  the  right  of  the  Output  Folder 
field,  browsing  until  locating  another  folder  to  use  instead,  and  then  selecting  that  desired 
folder’s  name. 
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I  Basic  jj  Gassifier  |  Help  | 


Corpus  P 
Corpus  Q 


Name 

Name 


File  f 
File  f 


\Documents 


Figure  1.  Uploading  files  with  optional  short  names,  and  designating  path  for  output  files  (Basic  tab). 

6.2  Preprocessing  Text  in  User  Files 

To  the  right  of  the  Basic  tab  screen  (see  also  section  6. 1  Uploading  User  Files )  are  the 
Preprocessing  options,  as  shown  in  figure  2.  By  default,  these  are  unselected.  There  are  two 
options  available: 

•  Tokenize  Punctuation  will  separate  any  word-external  punctuation  from  the  word  to 

which  it  is  attached;  this  ensures  that  a  word  directly  before  and/or  directly  after  a  comma, 
period,  quotation  mark,  or  other  punctuation  will  be  recognized  after  pre-preprocessing  and 
added  to  the  count  for  that  word  elsewhere,  and  the  punctuation  itself  will  be  counted 
separately.  By  applying  this  process  to  word-external  punctuation  only,  this  avoids  the 
incorrect  separation  of  other  punctuation,  as  would  otherwise  happen  with  “can’t” 
becoming  three  tokens',  can  '  t .  ( For  a  full  list  of  what  is  recognized  as  punctuation 

by  this  tool,  see  appendix  B.) 

•  Convert  to  Lowercase  will  convert  all  characters  in  Roman  script  to  lowercase;  this 
ensures  that  The  and  the  are  counted  as  the  same  word.  (The  user  must,  however, 
determine  when  this  is  not  applicable,  such  as  when  the  Roman  script  is  the  Buckwalter 
transliteration  of  Arabic  script.) 

-  Preprocessing 
I-  Convert  to  Lowercase 
I-  Tokenize  Punctuation 
I-  Strip  Punctuation 

I-  Strip  Stopwonds 


Figure  2.  Altering  input  text  before 

calculating  measures  (Basic  tab). 
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6.3  Display  Options  for  List  Views 

At  the  right  of  the  Basic  tab  screen,  below  the  Compare  Corpora  button  (see  section  6.4 
Compare  Corpora  Button),  there  are  three  checkboxes  with  options  for  text  terms  displayed  on 
the  screen  within  the  tool  (as  shown  in  figure  3): 

•  Right  to  Left  switches  the  text  direction  for  languages  such  as  Arabic;  this  option  is 
unselected  by  default. 

•  Use  Buckwalter  decoding  takes  Arabic  that  has  been  written  in  Roman  script  using  Tim 
Buckwalter’s  transliteration  schema  (Buckwalter,  2002)  and  converts  it  into  Arabic 
script^;  this  option  is  also  unselected  by  default. 

•  Automatically  Sort  by  Frequency  sorts  the  word  lists  by  the  type  frequencies;  this  option 
is  selected  by  default.  However,  the  user  may  wish  to  deselect  this  option  if  the  files  are 
especially  large  and  the  user  is  mainly  concerned  with  the  divergence  measures  because  the 
word  list  sorting  can  be  memory  intensive  and  slow  down  the  processing. 


~  Display  Options  ■ - 

r  Right  to  Left 

f"  Use  Buckwalter  decoding 

I*  Automatically  Sort  by  Frequency 


Figure  3.  Screen  display  options  in  list  views  (Basic  tab). 

6.4  Compare  Corpora  Button 

After  the  user  has  loaded  the  files  and  adjusted  any  preprocessing  or  display  options  desired, 
click  on  the  Compare  Corpora  button  (figure  4)  to  run  the  back-end  calculations  and  generate 
the  word  list  views  and  divergence  measures  (see  section  6.5  Word  Lists  and  section  6.6 
Divergence  Measures ). 


Compare 

Corpora 


Figure  4.  Button  to  run  divergence 
measures  calculations 
(Basic  tab). 


■14  This  step  performs  a  character-for-character  substitution. 
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6.5  Word  Lists 


After  loading  files,  the  three  blocks,  with  different  color  backgrounds  as  shown  in  figure  5, 
labeled  Corpus  P,  Corpus  Q,  and  Intersection  will  fill  with  words  from  the  uploaded  text. 
Within  the  top  blocks  for  Corpus  P  and  Corpus  Q,  there  are  two  panels,  or  list  views  —  one 
view  with  all  of  the  words  in  the  corpus  and  their  frequencies,  another  view  with  the  words  that 
occur  only  in  one  corpus  but  not  the  other.  As  indicated  by  the  name,  the  Intersection  block 
below  lists  only  words  that  occur  in  both  corpora. 


Corpus  P 

Words 


Token 


Freq 


-  Intersection 


#  of  Types 
#  of  Tokens 


Corpus  Q 

Words 


Token 


Freq 


U  of  Types 
#  of  Tokens 


U  of  Types 
#  of  Tokens 


Figure  5.  Results  in  different  list  views  (Basic  tab). 

At  the  bottom  of  each  list  view  are  two  summary  counts  for  that  set:  the  number  of  types  in  the 
set  and  the  number  of  tokens  in  the  set.  The  number  of  types  corresponds  to  the  number  of  rows 
in  the  list  view  above.  The  number  of  tokens  corresponds  to  the  sum  of  values  in  the  frequency 
column  in  the  list  view  above. 
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6.6  Divergence  Measures 


The  bottom  right  box  labeled  Divergence  Measures,  as  shown  in  figure  6,  provides  the  nine 
divergence  measure  calculations  (the  Renyi  and  Kullback-Leibler  measures  are  asymmetric  and 
are  thus  calculated  in  both  directions).  The  Renyi  measure  uses  a  constant  Alpha;  this  is  set  to 
0.99  by  default  but  can  be  any  number  greater  than  0  except  for  1 .  All  of  the  measures  have  been 
normalized  as  difference  measures^  that  return  0  when  the  two  corpora  are  identical.  The  range 
of  each  measure  is  listed  on  the  right  side  to  help  the  user  interpret  the  numbers. 


Figure  6.  Resulting  measures  (Batch  tab). 

6.7  Reports 

On  the  far  right  side  of  the  screen  is  a  box  labeled  Reports  (figure  7).  The  button  in  this  box 
generates  Excel  spreadsheets  as  follows: 

•  Selecting  the  Word  Distribution  checkbox  and  then  clicking  the  Generate  Report  button 
will  create  a  spreadsheet  with  the  information  from  the  word  lists  (see  section  6.5  Word 
Lists). 


15  s  This  applies  for  the  cases  where  the  measures  have  been  traditionally  defined  as  similarity  measures  and  the  value  of  1 
indicates  identity  of  the  two  sets. 
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•  Selecting  the  Divergence  Measures  checkbox  and  then  clicking  the  Generate  Report 
button  will  create  a  spreadsheet  with  the  calculated  divergence  measures  (see  section  6.6 
Divergence  Measures). 

•  Selecting  both  checkboxes  will  generate  both  spreadsheets. 


Figure  7.  Generating  results  in  summary  reports  (Basic  tab). 


6.8  Clear  Data  Button 

The  Clear  Data  button  (figure  8)  located  in  the  lower  right  comer  of  the  screen  serves  to  clear 
all  of  the  previously  loaded  data.  This  button  ensures  that  no  data  from  previous  files  is  still 
stored  in  memory. 

Clear 

Data 


Figure  8.  Button  to  remove  all  values  in  all  fields 
on  the  screen  and  in  memory  (Basic  tab). 


7.  Use  Cases 


As  the  DMTool  has  evolved,  new  tabs  have  been  added  to  the  interface  to  support  new  use  cases. 
In  this  section,  we  describe  the  Classifier  tab  in  detail  and  then  provide  a  brief  overview  of  the 
other  use  cases  for  which  the  DMTool  has  been  augmented.  Details  for  running  these  latter  cases 
with  the  Batch  and  Super  Batch  tabs  are  not  necessary  for  getting  started  using  the  DMTool  and 
so  have  not  been  included  in  this  report,  but  will  appear  in  the  extended  technical  report. 

7.1  Classifier  Construction  and  Evaluation 

The  construction  of  a  two-way  classifier  requires  at  a  minimum  four  files:  two  as  training  sets 
representative  of  the  two  domains  during  the  classifier  development  stage  and  two  as  test  sets 
again  as  separate  representatives  of  the  two  domains  during  the  classifier  evaluation  stage.  The 
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Classifier  tab  supports  all  pair-wise  comparisons  of  these  files  so  that  the  user  can  determine 
how  to  select  and  partition  the  available  corpora  for  training  and  testing. 

The  Classifier  screen  is  accessed  by  selecting  the  tab  at  the  top  labeled  Classifier.  This  screen 
also  supports  the  visualization  of  divergence  scores  within  a  corpus.  Most  of  this  screen  contains 
the  same  controls,  buttons,  and  displays  as  described  above  for  the  Basic  screen  in 
section  6  (specifically,  Preprocessing,  Display  Options,  Compare  Corpora  Button,  Word  Lists, 
Divergence  Measures,  and  Clear  Data  Button ). 

At  the  top  of  the  Classifier  screen  (figure  9),  the  user  uploads  four  files  from  two  different 
domains  (Domain  0  and  Domain  1).  For  each  domain,  there  is  one  training  file  and  one  test  file. 
The  domain  names  cannot  be  identical,  must  be  1 1  or  fewer  characters,  and  cannot  include  the 
following  symbols:  [  ]  *  ?  /  \  .  As  with  the  Basic  screen,  the  DMTool  provides  a  default  path  and 
name  in  the  Output  Folder  field  where  the  user  can  either  directly  enter  or  browse  and  select  the 
folder  name  to  where  output  results  will  be  stored. 


Figure  9.  Uploading  files  (Classifier  tab). 

7.1.1  Corpus  P  and  Corpus  Q  Selection 

After  all  four  files  are  loaded,  the  user  needs  to  select  the  two  to  be  compared  as  Corpus  P  and 
Corpus  Q,  as  is  done  in  the  Basic  tab. 

To  do  so,  there  is  a  drop-down  menu  for  each  corpus  just  above  the  word  lists,  as  shown  in 
figure  10.  These  menus  are  automatically  populated  after  the  load,  so  all  the  user  needs  to  do  is 
select  which  two  files  to  compare  at  any  given  time.  The  word  list  views,  as  well  as  the  measure 
calculations,  will  change  depending  on  the  selected  files. 


Corpus  P 

[Training  Data  (Domain  0) 

_ 

Training  Data  (Domain  0) 

Corpus 

Training  Data  (Domain  1) 

Words 

Test  Data  (Domain  0) 

Test  Data  (Domain  1) 

I  Token 

Freq 

1  1 

Unique  Words  (not  in  Corpus  Q) 
Token  |  Freq 


Corpus  Q  |  Training  Data  (Domain  0)  ^  j 

Corpus  Q 

Words 

Token  1  Freq 


Unique  Words  (not  in  Corpus  P) 
Token  1  Freq 


Figure  10.  Selecting  the  files  to  be  compared  (Classifier  tab). 
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7.1.2  Reports 


The  Reports  portion  of  the  Classifier  screen  differs  by  one  option  from  what  has  already  been 
introduced  for  generating  reports  on  the  Basic  screen:  there  is  a  drop-down  menu  linked  to  the 
Divergence  Measures  checkbox.  The  user  can  use  this  menu  to  generate  a  report  (figure  11)  for 
either  the  current  pair  of  files  (selected  by  the  drop-down  menus  described  in  section  7.1.2 
Corpus  P  and  Corpus  Q  Selection )  or  for  all  six  possible  combinations  of  different  files 
(Training  Domain  0  -  Training  Domain  1,  Test  Domain  0  -  Test  Domain  1,  Training  Domain  0  - 
Test  Domain  0,  Training  Domain  1  -  Test  Domain  1,  Training  Domain  0  -  Test  Domain  1, 
Training  Domain  1  -  Test  Domain  0). 


Reports 

r  Word  Distribution 
W  Divergence  Measures 


|  Current  Pair 

3 

■  Current  Pair 

1  All  Combinations 

|ir|  Generate 
l‘”l  Report 

Figure  11.  Generating  results  in  summary  reports  (Classifier  tab). 


7.2  Within  Corpus  Analysis 

In  the  process  of  constructing  MT  engines  that  would  translate  a  set  of  military  manuals  from 
Arabic  to  English,  we  created  numerous  subsamples  from  the  manuals  and  discovered  that  the 
resulting  MT  engines  yielded  a  wider  range  of  evaluation  scores  than  we  had  been  expecting. 
This  led  us  to  ask  how  widely  different  portions  of  the  manuals  varied  from  each  other.  After 
formulating  the  question  this  way,  we  realized  that  we  could  partition  the  English  side  of  the 
corpus  into  separate  files  and  then — by  extending  the  DMTool  with  a  Batch  tab  to  run  multiple 
pair-wise  comparisons — we  could  score  each  military  manual  partition  against  each  of  the  other 
partitions  and  inspect  how  much  their  lexical  content  varied. 

7.2.1  Uploading  Files 

The  Batch  tab  was  added  to  the  later  extended  versions  of  the  DMTool  to  support  both  this 
within-corpus  analysis  and  the  more  general  NxM  dataset  comparisons.  As  shown  in  figure  12, 
rather  than  using  the  Basic  tab,  the  user  can  select  the  Batch  tab  and  directly  upload  one 
collection  of  their  files  under  the  P  Corpus  Set  (where  P  contains  N  files)  and  the  other  collection 
of  their  other  files  under  the  Q  Corpus  Set  (where  Q  contains  M  files,  possible  the  same  as  in  P  if 
running  a  within  corpus  analysis). 
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p 


File  Name 

Size 

,eJVoI  1  &Vol  2  (FansiH0k-1.txt 

88  KB 

^ijVol  1  &  Vol  2  (Farsi)-1 0k-2.txt 

93  KB 

Vol  1  4  Vol  2  (FarsiJ-1 0k-3.txt 

89  KB 

Vol  1  4  Vol  2  (Fara>1 0k4.txt 

90  KB 

^yVol  1  4  Vol  2  (Farei)-10k-5ixt 

88  KB 

Vol  1  4  Vol  2  (FareO-1 0k-6.txt 

88  KB 

^  Vol  1  4  Vol  2  (FarslJ-1 0k-7.txt 

90  KB 

^Vol  1  4  Vol  2  (Farei)-10k-8.M 

98  KB 

^JVol  1  4  Vol  2  (Farsi>10k-9.txt 

96  KB 

Vol  1  4  Vol  2  (Farsi)-10k-10.txt 

95  KB 

Vol  1  4  Vol  2  (FaraO-1  Ok-1 1  .txt 

93  KB 

^JVol  14  Vol  2  (Fara>10k-12M 

91  KB 

Vol  1  4  Vol  2  (Farai)-1  Ok-1 3.txt 

94  KB 

^Vol  1  4  Vol  2  (Farsi)-10k-14.txt 

90  KB 

^JVol  1  4  Vol  2  (Farsj)-1  Qk-1 5.txt 

93  KB 

O  Self  Compare 


Q 


Batch  Report 
Output  Folder 
C:\Download 


g  UseC 


Figure  12.  Uploading  files:  top  screen  before  upload,  bottom  screen  after  a  partitioned  upload  (Batch  tab). 

Instead  of  selecting  one  file  for  corpus  P  and  one  file  for  corpus  Q,  the  user  can  select  multiple 
corpora  for  P  and  for  Q  by  using  the  Add  button.  The  Remove  button  will  remove  all  of  the  files 
that  are  selected.  The  Select  All  button  will  select  all  of  the  files  listed  in  the  list  box.  When  the 
user  clicks  on  the  Compare  Corpora  button,  the  program  will  compare  all  of  the  pairwise 
combinations  possible  between  the  list  of  corpora  in  P  and  the  list  of  corpora  in  Q. 

Another  variation  on  a  within-corpus  analysis  is  provided  by  the  Self  Compare  checkbox.  This 
functionality  enables  the  user  to  perform  a  unique  “hold-one-out”  type  of  comparison.  These 
comparisons  are  run  between  each  individual  file  in  corpus  P  (each  as  a  “hold  out”)  against  one 
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new  file  that  is  constructed  “on-the-fly”  by  combining  all  of  the  other  (“non-held  out”)  files  in 
corpus  P. 

7.2.2  Results 

The  Batch  tab  is  very  similar  in  functionality  to  the  Basic  tab  but  with  only  four  measures: 
Jensen-Shannon,  Kullback-Lciblcr.  Bhattacharyya,  and  Variational.  Due  to  screen  space 
constraints,  no  individual  word  breakdowns  are  displayed.  If  desired,  these  can,  of  course,  be 
individually  computed  in  the  Basic  tab. 

Figure  13  shows  the  screen  display  with  results  for  the  four  measures  calculated  on  each  pairwise 
file  comparison,  following  the  upload  in  figure  12.  (Notice,  for  example,  that  partitions  1  and  9 
are  the  most  divergent  pair,  with  the  highest  scores  in  the  ninth  row  within  each  column.) 
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Files  Compared 

Jensen-Shannon 

Kullback-Leibler 

Bhattacharyya 

Variational  A 

Vol  1  &  Vol  2  (Farsi}-10k-1  txt 

VS 

Vol  1  &  Vol  2  (Farsi)-1  Ok-1  txt 

0.0000 

0.0000 

0.0000 

0.0000  1=1 

Vol  1  &  Vol  2  (Farsi)-10k-1  .txt 

vs 

Vol  1  &  Vol  2  (Farsl)-10k-2.txt 

0.4495 

3.6661 

0.4180 

1.1746 

Vol  1  &  Vol  2  (Farei}-1  Ok-1  .txt 

vs 

Vol  1  &  Vol  2  (Farsi}- 1 0k-3.txt 

0.4239 

3.3057 

0.3912 

1.1306 

Vol  1  &  Vol  2  (Farei)-1  Ok-1  .txt 

vs 

Vol  1  &Vol  2{Farsi)-10k-4.txt 

0.4536 

34932 

0.4192 

1.1752 

Vol  1  &  Vol  2  (Farsi}-1  Ok-1  .txt 

vs 

Vol  1  4  Vol  2  (Farsi)- 1 0k-5  txt 

0.4532 

3.8031 

0.4203 

1.1768 

Vol  1  &  Vol  2  (Farsi}-10k-1  txt 

vs 

Vol  1  &Vol2(Fars0-10k-€.txt 

0.3921 

3.2032 

03642 

1.0656 

Vol  1  &  Vol  2  (Farsi)-1Gk-1  txt 

vs 

Vol  1  &  Vol  2  (Farsi)-1 0k-7.txt 

0.3811 

3.5193 

0.3572 

1.0196 

Vol  1  &  Vol  2  (Farei}-1  Ok-1  .txt 

vs 

Vol  1  &  Vol  2  (Farsi)- 1 0k-8.txt 

0.5053 

4.2754 

0.4789 

1.2468 

Vol  1  &  Vol  2  (Farei)-1  Ok-1  .txt 

vs 

Vol  1  4  Vol  2  (Farsl}-1  Ok-9  txt 

0.5143 

4.4592 

0.4880 

1.2610 

Vol  1  &  Vol  2  (Farsi)-1  Ok-1  .txt 

vs 

Vol  1  4  Vol  2  (Farsi}-1  Ok-1 0.txt 

0.4920 

43277 

0.4653 

1.2250 

Vol  1  &  Vol  2  (FarslJ-IOk-1  .txt 

vs 

Vol  1  &  Vol  2  (Farsi)-!  Ok-1 1  .txt 

0.4871 

4.1641 

0.4575 

1.2176 

Vol  1  &  Vol  2  (Farsi)-10k-1  .txt 

vs 

Vol  1  &  Vol  2  (Farsi)-10k-12  .txt 

0.5111 

4.3935 

0.4848 

1.2414 

Vol  1  &  Vol  2  (Farei}-1  Ok-1  .txt 

vs 

Vol  1  &  Vol  2  (Farsl}-1  Ok-1 3.txt 

0.4581 

4.0116 

0.4312 

1.1660 

Vol  1  &  Vol  2  (Farei)-1  Ok-1  .txt 

vs 

Vol  1  S  Vol  2  (FatslF10k-14.txt 

0.4619 

40504 

0.4360 

1.1674 

Vol  1  &  Vol  2  (Farsi>-1  Ok-1  .txt 

vs 

Vol  1  4  Vol  2  (Farsi)-10k-15.txt 

0.4762 

4.0495 

0.4464 

1.2110 

Vol  1  &  Vol  2  (FarsiJ-10k-2.txt 

vs 

Vol  1  &  Vol  2  (Farsi)- 1  Ok-1  txt 

0.4495 

3.0836 

0.4180 

1.1746 

Vol  1  &  Vol  2  (FarsiH0k-2.txt 

vs 

Vol  1  &  Vol  2  CFarsiVI  0k-2.txt 

0.0000 

0.0000 

0.0000 

o.oooo  - 

Figure  13.  Results  of  file  comparisons  (Batch  tab). 

7.2.3  Measures 

The  Measures  area,  as  shown  in  figure  14,  will  display  the  Mean,  Minimum  value,  Maximum 
value,  and  the  Standard  Deviation  for  each  of  the  four  measures.  These  values  are  calculated 
over  the  whole  list  of  comparisons  shown  in  the  Results  window. 
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Jensen-Shannon  Kuilback-Leibler 


Figure  14.  Summary  results  (Batch  tab). 

7.2.4  Batch  Report 

The  Batch  Report  allows  the  user  to  select  an  output  directory  and  then  generate  a  report 
(figure  15).  The  report  is  an  Excel  spreadsheet  with  the  first  tab  containing  all  data  (as  shown  in 
the  Results  window  and  the  Measures  area)  and  the  subsequent  tabs  in  the  spreadsheet 
containing  a  file-by-file  listing  of  each  of  the  measures’  scores. 

Another  more  accessible  visualization  of  a  within-corpus  analysis  is  also  available  via  the  Batch 
tab.  The  user  can  select  the  Use  Color  Scales  checkbox  feature  to  generate  an  automated  color 
shading  of  the  file-by-file  comparison  scores  in  a  matrix,  as  shown  in  figure  16,  ranging  from  the 
lowest  scores  (most  similar  or  least  divergent  comparisons)  as  greenest  to  the  highest  scores 
(most  dissimilar)  as  the  reddest.  This  feature  is  only  available  with  Microsoft  Office  version 
2007  and  above. 
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Vol 2-5k- 
1.1x1 

Vol  2-5k 
2.0x0 

Vol 2-5k- 
3.1x0 

Vol2-5k 

4.1x1 

Vol2*5k 

5.1x1 

Vol  2-5k- 

0.1x1 

Vol2-5k 

7.1x1 

Vol  2-5k 

Vol  2-5k-  Vol2-5k 
0.1x1  10.1x1 

Vol 2-5k- 

Vol 2-5k 
12.1x1 

Vol2-5k 

13.1x1 

Vol  2-5k 
14.1x1 

Vol  2-5k- 
15.1x1 

Vol2-5k- 

1*.tx1 

Vol  2-5k- 
17.1x1 

Vol 2-5k- 
1*.1x1 

Vol 2-5k 
10.1x1 

Vol 2*5k 
20.1x1 

Vol  2-5k 
21.1x1 

Vol 2-5k-  Vol2-5k- 
22.1x1  23.1x1 

Vol2-5k- 

24.1x1 

10 

Vol  1  St  Vol  2-5k-1.1x1 

0.4002 

0.5025 

0.5041 

0.4*34 

0.4010 

0.4070 

0.5325 

0.44*4  0.5403 

0.5133 

0.5440 

0.55S* 

0.531* 

0.5434 

0.53*2 

0.522* 

0.5*24 

0.5420 

0.4*05 

0.5220  0.5101 

0.5027 

11 

Vol1*tVol2-5k-2.1x1 

0.4002 

0.4300 

0.4002 

0.4472 

0.4*04 

0.4*14 

0.5200 

0.443*  0.5147 

0.4*50 

0.573* 

0.407* 

0.5272 

0.4045 

0.5307 

0.5111 

0.5110 

0.53*2 

0.4*70 

0.4507 

0.4*0*  0.4*53 

0.4770 

12 

Vol1*tVol2-Sk-3.1x1 

0.5025 

« 

0.3344 

0.4150 

0.4030 

0.4533 

0.5040 

0.4*44  0.47** 

0.4731 

0.55** 

0.4740 

0.5102 

0.4*01 

0.4032 

0.4*7* 

0.5031 

0.51*5 

0.4720 

0.43*1 

0.4*54  0.4014 

0.4515 

13 

Vol1*tVol2-5k-4.1x1 

0.5041 

0.4002 

0.4720 

0.5133 

0.4000 

0.5200 

0.4302  0.4400 

0.5312 

0.*04* 

0.5153 

0.5*33 

0.5123 

0.5344 

0.5324 

0.535* 

0.55*4 

0.5177 

0.4*20 

0.5054  0.5240 

0.51*3 

14 

Vol  1feVol2-Sk-5.txt 

0.4034 

0.4472 

0.4150 

0.4720 

0.3*01 

0.3*04 

0.4*14 

0.4*33  0.4**2 

0.4101 

0.5775 

0.4**2 

0.4047 

0.4445 

0.4*32 

0.4*02 

0.4*71 

0.4713 

0.4440 

0.423* 

0.4302  0.440* 

0.4011 

15 

Vol  1 4t  Vol  2-5k-*.1x1 

0.4010 

0.4004 

0.4030 

0.5133 

0.3307 

0.4101 

0.4*4*  0.42*4 

0.4327 

0.5**0 

0.5032 

0.50*4 

0.4*5* 

0.40*1 

0.4**0 

0.4*02 

0.4*0* 

0.4*15 

0.4401 

0.42*7  0.41*0 

0.4171 

1* 

Vol  1 0t  Vol  2-5k-7.lx1 

0.4070 

0.4014 

0.4533 

0.4000 

MM 

0.3307 

0.410* 

0.4717  0.3*70 

0.43*4 

0.5*0* 

0.4*45 

0.5122 

0.4527 

0.4*31 

0.4700 

0.4751 

0.4755 

0.44*1 

0.40*1 

0.3*52  0.4171 

0.4144 

17 

Vol1StVol2-Sk-*.1x1 

0.5325 

0.5200 

0.5040 

0.5200 

0.4014 

0.410* 

0.4205  0.3*03 

0.50*0 

0.*070 

0.53*3 

0.55*0 

0.5120 

0.531* 

0.5202 

0.5073 

0.515* 

0.5242 

0.474* 

0.4*34  0.42** 

0.45** 

10 

Vol1*tVol2-5k-0.1x1 

0.4400 

0.4430 

0.4044 

0.4302 

0.4033 

0.4040 

0.4717 

0.4205 

0.3027 

0.5152 

0.50*3 

0.5200 

0.5**3 

0.524* 

0.54*0 

0.5202 

0.51*5 

0.54*3 

0.5235 

0.4550 

0.4*00  0.4*53 

0.4*53 

10 

Vol1*tVol2-5k-10.1x1 

0.5403 

0.5147 

0.4700 

0.4400 

0.4002 

0.42*4 

0.3*70 

0.3*03 

0.3027  0.5341 

0.5152  0.5341 

0.*2S0 

0.5435 

0.50*1 

0.545* 

0.5*10 

0.5515 

0.5335 
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Figure  15.  Generating  summary  results  (Batch  Report). 
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Figure  16.  Summary  of  measures  in  color-coded  matrix  (Batch  tab). 

The  colors  in  this  figure,  for  example,  suggest  that  the  first  1 1  partitions  and  the  last  12  partitions 
are  more  distinct  from  each  other  than  they  are  alike.  The  ground  truth  for  this  corpus 
corroborates  this  interpretation:  these  two  sets  correspond  to  two  volumes  of  a  medical  textbook. 

Furthermore,  the  green  cell  that  appears  to  be  an  island  in  the  matrix  in  the  comparison  of  the  last 
partition  of  the  first  set  (filename  ends  -11)  and  the  last  partition  of  the  second  set  (filename  ends 
-24)  is  also  quite  revealing:  there  is  an  index  of  terms  at  the  end  of  the  first  textbook  volume  that 
contains  many  terms  in  the  index  at  the  end  of  the  second  textbook  volume. 

7.3  Assessment  of  Line- Aligned  Parallel  Files 

Not  long  after  extending  the  DMTool  for  the  within-corpus  analysis  in  building  MT  engines,  we 
also  realized  that  the  DMTool  could  help  us  detect  potential  errors  in  the  line-level  alignments  of 

15 


parallel  source-to-target  language  files  that  are  the  training  and  test  datasets  for  MT  efforts.  We 
leave  it  as  an  exercise  for  the  eager  reader  to  discover  how  the  Super  Batch  tab,  as  described  in 
this  section,  supports  this  use  case. 

The  Super  Batch  tab  allows  the  user  first  to  select  two  corpora  and  have  the  DMTool 
automatically  generate  corpora  partitions,  either  in  terms  of  the  total  word  count  for  each 
partition  or  the  total  number  of  bins  for  partitioning  each  corpus,  and  then  to  compare  all  of  the 
pairwise  combinations  of  those  partitions.  As  with  the  Batch  tab,  the  DMTool  calculates  the 
same  four  divergence  measures  for  the  Super  Batch  tab:  Jensen-Shannon,  Kullback-Leibler, 
Kullback-Leibler  (reversed),  and  Variational. 

7.3.1  Uploading  Files 

The  Uploading  Files  block  (figure  17)  is  where  the  user  enters  the  paths  for  Corpus  P  and 
Corpus  Q  and  the  output  folder,  assigning  short  names  as  well  if  desired.  The  Output  Folder 
will  contain  all  of  the  Corpus  P  and  Corpus  Q  partition  files  that  are  created,  as  well  as  any 
reports  generated. 


Figure  17.  Uploading  files  (Super  Batch  tab). 


7.3.2  Construction  of  Subsets 


The  Subsets  block  (figure  18)  is  where  the  user  selects  the  size  of  the  subsets  to  be  created 
before  comparison  scores  are  calculated.  If  a  corpus  is  not  evenly  divisible  by  the  subset  size 
selected  for  the  partitions,  the  last  subset  created  at  the  end  of  the  corpus  will  have  fewer  words 
than  the  size  selected.  If  the  selected  subset  size  is  greater  than  the  size  of  the  full  corpus,  no 
subsets  will  be  created.  Alternatively,  the  user  can  select  the  number  of  bins  to  partition  the 
corpus  so  that  both  corpora  will  have  the  same  total  number  of  partitions.  These  two  methods  of 
subset  selection  are  mutually  exclusive.  The  user  can  do  one  or  the  other,  but  not  both. 

Subsets  (#  of  Tokens) 

□  5k  O  10k  □  25k  □  50k 

□  100k  □  250k  □  500k  □  1,000k 


#  of  Bins 


Compare 

Corpora 


Figure  18.  Construction  set  size  for  subsets  (Super  Batch  tab). 
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7.3.3  Status 


The  Status  window  (figure  19)  displays  the  progress  of  the  backend  code  performing  the 
divergence  calculations  as  it  is  executing  in  real  time.  Status  messages  provide  updates  as 
individual  subsets  are  created  and  file  comparisons  are  calculated.  The  completion  of  the  Super 
Batch  process  generates  a  status  message  as  well  as  an  audible  signal  to  alert  the  user  that  the 
process  has  finished  (assuming,  of  course,  that  the  sound  on  the  user’s  computer  is  not  muted). 


Figure  19.  Status  log  window  (Super  Batch  tab). 


7.3.4  Scores 

The  Scores  block  is  where  the  individual  divergence  measure  scores  are  displayed  for  each 
pairwise  comparison  for  four  of  the  divergence  measures:  Jensen-Shannon,  Kullback-Leibler, 
Bhattacharyya,  and  Variational  (figure  20). 

The  Data  View  drop-down  menu  box,  located  on  the  upper  right  of  the  Scores  block,  allows  the 
user  to  choose  the  score  types  to  display,  with  the  options  being  Mean,  Median,  Min-Max, 
Standard  Deviation,  and  #  of  Comparisons.  The  Mean  option  shows  the  average  of  all  pairwise 
comparison  scores  in  each  cell  for  the  given  partition  size  pairs.  The  Median  option  shows  the 
value  at  the  middle  of  the  distribution  of  the  scores  in  each  cell.  If  the  number  of  comparison 
scores  is  even,  the  middle  two  scores  are  averaged.  The  Min-Max  option  shows  the  lowest  and 
highest  divergence  scores  in  each  cell  for  the  given  partition  size  pairs.  The  Standard  Deviation 
computes  this  value  in  each  cell  again  for  the  given  partition  size  pairs.  The  #  of  Comparisons 
option  shows  the  number  of  pairs  in  each  cell.  The  user  also  has  a  selection  box,  seen  in  blue,  for 
each  divergence  measure,  so  that  all  individual  pairwise  scores  computed  for  a  particular  cell 
may  be  displayed  separately  as  well  in  the  Cell  Details  block,  as  described  in  the  next  subsection 
of  this  report. 
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Scores  Data  I Mean 
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Figure  20.  Scoring  summary  results  (Super  Batch  tab). 

7.3.5  Score  Cells  in  Detail 

Given  that  the  summary  result  table  collapses  numerous  individual  comparisons,  the  Cell  Details 
block  enables  the  user  to  drill  down  and  examine  the  individual  divergence  measure  scores  that 
went  into  the  summary  statistics  for  each  of  the  selected  subset  comparison  cells  in  the  Scores 
block.  An  example  of  the  detailed  breakout  is  shown  in  figure  21  for  the  summary  cells  of  P 
subsets  of  size  25K  tokens  scored  against  the  Q  subsets  of  size  25K  tokens  (for  cells  colored  in 
blue  in  figure  20). 
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Cell  Details 


Figure  21.  Cell-level  details  for  pair-wise  scoring  results  (Super  Batch  tab). 


7.3.6  Reports 

The  Reports  block  of  the  Super  Batch  screen,  as  with  the  other  screens,  allows  the  user  to  select 
the  type  of  report  to  be  created  and  then  to  generate  that  report  with  the  Generate  Report  button 
(figure  22).  The  two  types  of  reports  are  both  generated  in  Excel  spreadsheets.  The  Summary 
report  shows  exactly  what  is  displayed  in  the  Scores  block.  The  Cell  Details  report  will  show 
what  is  displayed  in  the  Cell  Details  block,  but  only  for  those  comparison  scores  from  the  subset 
comparison  (cell)  selected  in  the  Scores  block.  If  both  report  types  are  selected,  only  one  Excel 
spreadsheet  will  be  created,  with  the  Summary  data  on  the  first  tab  and  the  Cell  Details  on  the 
second  tab  of  the  same  saved  spreadsheet. 
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Reports 


w  Summary 
I-  Cell  Details 


Generate 

Report 


Figure  22.  Generating  reports  (Super  Batch  tab). 


8.  Help  Tab 


The  Help  tab  contains  basic  information  on  the  tool,  including  the  following  two  screen-internal 
tabs. 


8.1  Assumptions  and  Definitions 

This  is  the  information  on  the  assumptions  and  definitions  as  used  in  implementing  the  equations 
computationally.  This  includes  the  information  from  section  5  and  appendix  A. 

8.2  Equations 

These  are  the  underlying  mathematical  equations  for  the  measures.  See  appendix  A. 


9.  Conclusion 


The  current  version  of  the  DMTool  allows  for  basic  functionalities  for  comparing  text  corpora  on 
similarity  measures.  The  following  functionalities  are  under  consideration  for  inclusion  in  future 
versions  of  this  tool: 

•  Reading  in  files  with  UTF-16  encoding 

•  Automatically  generating  a  Venn  diagram  comparing  type  counts  of  the  compared  files. 

•  Calculating  the  perplexity  of  each  corpus. 

•  Eliminating  current  error  when  opening  reports  with  Microsoft  Excel  2007. 

•  Inputting  a  personalized  punctuation  list  for  the  purpose  of  tokenization. 

•  Calculating  the  measures  and  word  lists  by  n-grams. 
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Appendix  A.  Divergence  Measure  Equations 


The  following  details  the  divergence  measure  equations. 

Let  P  be  one  text  corpus ,  Q  be  another  text  corpus. 

Let  P0/ws  =  set  of  unique  types  in  P,  and  Q0/ws  =  set  of  unique  types  in  Q, 


then  let  N  be 


I  P types  U  Qtypes  I  =  #unique  types  in  the  union  of  sets  P  and  Q 
For  every  unique  type  a(i),  let  p(i)  be  its  relative  frequency  in  P, 

p(i)  =  (#occurrences  of  token  a(i)  in  P)  /  (total  #tokens  in  P) 
and  let  q(i)  be  its  relative  frequency  in  Q 

q(i)  =  (#occurrences  of  token  a(i)  in  Q)  /  (total  #tokens  in  Q) 
Using  these  definitions,  the  equations  for  the  divergence  measures  are  as  follows: 


(A-l) 


(A-2) 


(A- 3) 


Renyi 


(P;Q;a)=  E[p(01_° -q(i)a ] 

a  i  i=i 


(A-4) 


Kullback-Leibler 


(P;Q)=  Z  P(01og 


(A- 5) 


Bhattacharyya 


i  v 

(P;Q)= 


(A-6) 


Jensen-Shannon 


1  [  ( p{i)  +  q{i)y\  ^  [  ( p(i)  +  q(i)\\ 

(P;Q)=-  l°g2p0)-log2  - - -  log2^(0-log2  - ——  (A-l) 


Variational 


(P;Q)  =  Y^pdl-qH) 


(A- 8) 


Euclidean 


(P;Q)  =  rT(pd)-qd))2 


(A-9) 
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Cosine 


(A- 10) 


(P;Q)=  i 


v 


X  p(o  •  9(0 

1=1 _ 

Hv  /v 

£p(  o2-2«(o2 

i=i  i=i  y 
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Appendix  B.  Punctuation 


The  following  is  a  list  of  the  punctuation  recognized  by  the  preprocessing  option  Tokenize 
Punctuation.  Any  punctuation  not  on  this  list  will  not  be  properly  split  off: 


1  PUNCTUATION  NAME  j 

MARK 

i  UNICODE  j 

period 

002E 

comma 

, 

002C 

question  mark 

? 

003F 

colon 

003A 

semi-colon 

' 

003B 

exclamation  mark 

! 

0021 

left  parentheses 

( 

0028 

right  parentheses 

) 

0029 

forward  slash 

/ 

002F 

back  slash 

\ 

005 C 

left  bracket 

[ 

005 B 

right  bracket 

] 

005 D 

dash 

- 

002D 

equal  sign 

= 

003D 

Arabic  comma 

l 

060C 

Arabic  semi-colon 

i 

061B 

Arabic  period 

06  IE 

Arabic  question  mark 

? 

061F 

Left  Single  Quotation  Mark 

6 

2018 

Single  High-Reversed  Quot  Mark 

t 

201B 

Right  Single  Quotation  Mark 

•> 

2019 

Left  Double  Quotation  Mark 

“ 

201C 

Right  Double  Quotation  Mark 

201D 

Double  High-Reversed  Quot  Mark 

cc 

201F 

Double  Prime 

ff 

2033 

Reversed  Prime 

2035 

Reversed  Double  Prime 

2036 

25 


Intentionally  Left  Blank. 
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Appendix  C.  Tutorial 


There  are  two  example  files  included  with  this  tool,  named  “Sample  l.txt”  and  “Sample  2.txt.” 
This  tutorial  will  walk  the  user  through  using  the  Basic  tab  with  these  files,  both  so  that  the  users 
can  ensure  the  tool  is  working  properly  and  so  that  users  can  understand  how  to  run  the  tool  with 
their  own  files. 


1.  First,  upload  the  files.  Select  the  button  labeled  “. . .”  to  the  right  of  File  fields  and  navigate 
to  the  folder  where  the  Divergence  Measure  Tool  is  stored,  then  select  the  example  files. 


Basic  Batch  |  Super  Batch  j  Classifier  |  Preferences  |  'O)  Help  | 


Corpus  P 
Corpus  Q 


Name  \~ 
Name  f 


File  |C:\Projects  .NET C#\Divergence  Measures  TooIXSample  l.txt 

File  |C:\Projects  .NET  C#\Divergence  Measures  TooIXSample  2.txt 

Output  Folder  |C:\Download 


-  Preprocessin 
V  Convert 
P  Tokenizi 
r  Strip  Pur 

I-  Strip  Sto 


Corpus  P 

Words  Unique  Words  (not  in  Corpus  Q) 


Token 

Freq 

Token 

Freq 

Corpus  Q 

Words  Unique  Words  (not  in  Corpus  P) 


Token 

Freq 

Token 

Freq 

2.  Now  name  the  files  by  filling  in  the  Name  fields. 

Basic  j  Batch  |  Super  Batch  |  Classifier  j  Preferences  |  *Q)  Help  j 


Corpus  P 

Words 


File  |C:\Projects  .NET C#\Divergence  Measures  TooIXSample  l.txt 

File  |C:\Projects  .NET  C#\Divergence  Measures  TooIXSample  2.txt 


Output  Folder  |ci\Downioad"" 


Corpus  Q 

Words 


Unique  Words  (not  in  Corpus  Q) 


Token 


Freq 


Token 


Freq 


Token 


“Preprocessing 
r  Convert  tc 
f”  Tokenize 
f-  Strip  Pune 

V  Strip  Stop 


Unique  Words  (not  in  Corpus  P) 


Freq 


Token 


Freq 
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3.  Click  the  Compare  Corpora  button. 


e 

n 

( 

"j  Options... 

3 

B 

^  Options... 

-  Display  Options 
I-  Right  to  Left 

Use  Eiyckwaler  decoding 
W  Automatically  Sort  by  Frequency 


The  results  should  look  like  this: 


-  Corpus  P 


r  Intersection 

Intersecting  Words 


_  Corpus  Q 


Divergence  Measures 


Alpha 

Renyi  0.99  ~7~j 
Kullback-Leibler 


P  ->  Q 


Q  ->  P 


0  =  Identical 

Flange 
0  oO 

0  oo 


0-1 

0-1 

0-2 

0  -  sqrt(2) 
0-1 


Notice  anything  strange?  We  didn’t  tokenize  the  punctuation  or  lowercase  the  text,  so  a  single 
word  (the)  shows  up  as  three  different  types  -  the ,  The ,  and  "The. 

4.  To  fix  this,  click  the  Clear  Data  button,  then  repeat  steps  1  and  2. 
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box. 
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6.  Click  the  Compare  Corpora  button.  Now  the  results  should  look  like  this: 


Corpus  P 

Words 


Unique  Words  (not  in  Corpus  Q) 


Token 

1 

Freq 

— ni 

Token 

Freq 

the 

713 

afp 

44 

^Z3 

60G 

spain 

19 

543 

sanaa 

18 

to 

2S8 

southern 

18 

in 

276 

embargo 

17 

of 

266 

juppe 

14 

" 

251 

smith 

14 

a 

246 

bosnian 

13 

and 

245 

mansell 

13 

227 

safe 

13 

( 

164 

evacuees 

12 

) 

162 

bosnia 

11 

said 

151 

hospital 

11 

122 

Palestinian 

11 

for 

111 

zi 

yemen 

10 

d 

U  of  Types  | 

3.230 1 

#  of  Types  | 

1.887 

U  of  Tokens  | 

14.020  | 

#  of  Tokens  | 

3.256 

r  Intersection 


-  Corpus  Q 


Divergence  Measures 


Alpha 

Renyi  0.99  -7] 
Kullback-Leibler 


P  ->  Q 


Q  ->  P 


0  =  Identical 

Range 
0  00 

0  00 


0-1 

0-1 

0-2 

0  -  sqrt(2) 
0-1 
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7.  Select  both  checkboxes  in  the  Results  box. 


Unique  Words  (not  in  Corpus  P) 


Token 

Freq 

H‘| 

- 

404 

1 

52 

korea 

33 

be 

32: 

? 

23 

kurdish 

23 

01 

27 

com 

27 

OOedt 

26 

kurds 

24 

aren 

21 

bush 

21 

mastercard 

21 

nuclear 

20 

z 

20 

d 

tt  of  Types  | 

3.034 

#  of  Tokens  | 

6.158 

Display  Options 


iRioht.tp.Lel 

I-  Use  Ekidcwater  decoding 
F  Automatically  Sort  by  Frequency 


—  Reports. 
/  F  Wo 
^^F  Div 


F  Wond  Distribution 
F  Divergence  Measures 


Use  Smoothing  on  Jensen-Shannon 


8.  Click  the  Generate  Reports  button.  This  should  create  two  spreadsheets  in  the  Documents 
folder,  named  “Word  Distribution”  and  “Divergence  Measures.” 


ft  Documents 


JjSjxj 


&c.  ▼  Computer  »  (C:)  Local  Disk  *  Users 


Documents  ▼ 


l&fl.l  |  Search 


m 


Organize  ▼  }||  Views 

Favorite  Links 
f]~  Documents 
Music 

(j£  Pictures 


@  Bum 


M  Date  modified  |  Type 


®  Basic  Divergence  Measures  -  Friday,  July  22,  2011  11;47;  14  AM.xls  7/22/2011  11:47  AM 


0 

JzU 


Microsoft  Office  Excel  97-2003  Worksheet 
Microsoft  Office  Excel  97-2003  Worksheet 
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Appendix  D.  FAQ 


Q:  Why  are  there  different  counts  for  the  words  The  and  the? 

A:  This  tool  is  case-sensitive;  if  the  user  wants  words  with  different  casing  to  be  counted 
together,  select  the  Convert  to  Lowercase  checkbox  in  the  Preprocessing  section  prior  to 
running  the  comparison. 

Q:  Why  are  there  different  counts  for  the  words  the  and  "the? 

A:  This  tool  looks  for  exact  matches  in  order  to  recognize  unique  words.  If  the  user  wants  to 
count  the  word  and  the  punctuation  preceding  or  following  it  separately,  select  the  Tokenize 
Punctuation  checkbox  in  the  Preprocessing  section  prior  to  running  the  comparison. 
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